# Deep Convolutional Networks on Graph-Structured Data

## Introduction

近年来，深度学习模型已经证明在从计算机视觉和声学建模到自然语言处理的各种任务上都非常成功[9]。 它们成功的核心在于对数据统计特性的重要假设，即自然图像，视频和语音中存在的局部统计的平稳性和合成性。 ConvNets [8，7]有效地利用了这些属性，这些属性旨在提取在信号域之间共享的局部特征。 因此，相对于通用的深层体系结构，它们能够大大减少网络中的参数数量，而不会牺牲从数据中提取信息统计信息的能力。 同样，对时态数据进行训练的递归神经网络（RNN）隐式地假设为平稳分布。

In recent times, Deep Learning models have proven extremely successful on a wide variety of tasks,  from computer vision and acoustic modeling to natural language processing [9]. At the core of their  success lies an important assumption on the statistical properties of the data, namely the stationarity  and the compositionality through local statistics, which are present in natural images, video, and  speech. These properties are exploited efficiently by ConvNets [8, 7], which are designed to extract  local features that are shared across the signal domain. Thanks to this, they are able to greatly  reduce the number of parameters in the network with respect to generic deep architectures, without  sacrificing the capacity to extract informative statistics from the data. Similarly, Recurrent Neural  Nets (RNNs) trained on temporal data implicitly assume a stationary distribution.

可以将这样的数据示例视为在低维网格上定义的信号。在这种情况下，平稳性是通过网格上的自然平移运算符很好地定义的，局部性是通过网格的度量来定义的，组成性是通过下采样获得的，或者等效地得益于网格的多分辨率特性。但是，存在许多缺少底层低维网格结构的数据示例。例如，可以将表示为单词袋的文本文档视为在图上定义的信号，该图的节点是词汇术语，其权重表示术语之间的某种相似性度量，例如共现统计。在医学上，可以将患者的基因表达数据视为由监管网络施加的图上定义的信号。实际上，计算机视觉和音频是深度学习研究的主要重点，它们仅代表在极其简单的低维图上定义的数据的特例。在其他域中出现的复杂图可能具有更高的维数，并且在这些图上定义的数据的统计属性可能无法满足先前描述的平稳性，局部性和组成性假设。对于此类维数为N的数据，深度学习策略被简化为具有O（N2）参数的全连接层的学习，并且通过权重衰减和辍学来进行正则化[17]。

One can think of such data examples as being signals defined on a low-dimensional grid. In this  case stationarity is well defined via the natural translation operator on the grid, locality is defined  via the metric of the grid, and compositionality is obtained from downsampling, or equivalently  thanks to the multi-resolution property of the grid. However, there exist many examples of data that  lack the underlying low-dimensional grid structure. For example, text documents represented as  bags of words can be thought of as signals defined on a graph whose nodes are vocabulary terms and  whose weights represent some similarity measure between terms, such as co-occurence statistics. In  medicine, a patient’s gene expression data can be viewed as a signal defined on the graph imposed  by the regulatory network. In fact, computer vision and audio, which are the main focus of research  efforts in deep learning, only represent a special case of data defined on an extremely simple lowdimensional graph. Complex graphs arising in other domains might be of higher dimension, and  the statistical properties of data defined on such graphs might not satisfy the stationarity, locality and compositionality assumptions previously described.For such type of data of dimension N,  deep learning strategies are reduced to learning with fully-connected layers, which have O(N2  )  parameters, and regularization is carried out via weight decay and dropout [17].

当已知输入的图结构时，[2]引入了一种模型，该模型使用类似于ConvNet的低学习复杂性来推广ConvNet，并在简单的低维图上得到了证明。 在这项工作中，我们有兴趣将ConvNets推广到高维的通用数据集，最重要的是，推广到先验不知道图形结构的环境。 在这种情况下，学习图结构等于估计相似度矩阵，该相似度矩阵的复杂度为O（N2）。 因此，人们可能会怀疑，在图估计之后进行图卷积是否相对于直接从具有完全连接层的数据中学习而言具有优势。 我们尝试通过实验回答这个问题，并为将来的工作建立基准。When the graph structure of the input is known, [2] introduced a model to generalize ConvNets using  low learning complexity similar to that of a ConvNet, and which was demonstrated on simple lowdimensional graphs. In this work, we are interested in generalizing ConvNets to high-dimensional,  general datasets, and, most importantly, to the setting where the graph structure is not known a priori.  In this context, learning the graph structure amounts to estimating the similarity matrix, which has  complexity O(N2  ). One may therefore wonder whether the graph estimation followed by graph  convolutions offers advantages with respect to learning directly from the data with fully connected  layers. We attempt to answer this question experimentally and to establish baselines for future work.

我们在以前无法应用卷积网络的两个应用领域中探索这些方法：文本分类和生物信息学。 我们的结果表明，我们的方法能够匹配或胜过大型的，完全连接的网络，这些网络使用较少的参数进行了辍学训练。 我们的主要贡献可以概括如下：

We explore these approaches in two areas of application for which it has not been possible to apply convolutional networks before: text categorization and bioinformatics. Our results show that  our method is capable of matching or outperforming large, fully-connected networks trained with  dropout using fewer parameters. Our main contributions can be summarized as follows:

我们将思想从[2]扩展到大规模分类问题，特别是Imagenet对象识别，文本分类和生物信息学。 •我们考虑最一般的设置，其中没有关于图结构的先前信息可用，并结合有监督图卷积，提出无监督和新的有监督图估计策略。

We extend the ideas from [2] to large-scale classification problems, specifically Imagenet  Object Recognition, text categorization and bioinformatics.  • We consider the most general setting where no prior information on the graph structure  is available, and propose unsupervised and new supervised graph estimation strategies in  combination with the supervised graph convolutions.

本文的其余部分的结构如下。 第2节回顾了文献中的类似作品。 第3节讨论图上卷积的一般化，第4节讨论图估计的问题。 最后，第5节显示了大规模对象识别，文本分类和生物信息学的数值实验。

The rest of the paper is structured as follows. Section 2 reviews similar works in the literature. Section 3 discusses generalizations of convolutions on graphs, and Section 4 addresses the question of  graph estimation. Finally, Section 5 shows numerical experiments on large scale object recogniton,  text categorization and bioinformatics.

## 2 Related Work



已经有几项使用所谓的局部感受场[6、4、14]探索了体系结构的著作，其中大部分都用于图像识别。 特别是，[4]提出了一种方案，用于学习如何基于以无监督方式获得的相似性度量将特征分组在一起。 但是，它没有尝试利用任何权重共享策略。

There have been several works which have explored architectures using the so-called local receptive  fields [6, 4, 14], mostly with applications to image recognition. In particular, [4] proposes a scheme  to learn how to group together features based upon a measure of similarity that is obtained in an  unsupervised fashion. However, it does not attempt to exploit any weight-sharing strategy.

最近，[2]提出了通过图拉普拉斯算子将卷积推广到图。 通过在网格中识别线性平移不变算子（拉普拉斯算子），并在一般图形中将其对应物（图拉普拉斯算子），可以将卷积视为与拉普拉斯算子交换的线性变换族。 通过将此换向属性与规则结合起来以查找局部滤波器，该模型每个“功能图”仅需要O（1）个参数。 但是，此构造需要图形结构的先验知识，并且仅在简单的低维图形上显示。 最近，[12]介绍了Shapenet，它是基于测地极坐标的非欧氏域上卷积的另一种推广，已成功应用于形状分析，并允许在不同流形之间进行比较。 但是，它也需要先有歧管知识

Recently, [2] proposed a generalization of convolutions to graphs via the Graph Laplacian. By  identifying a linear, translation-invariant operator in the grid (the Laplacian operator), with its counterpart in a general graph (the Graph Laplacian), one can view convolutions as the family of linear  transforms commuting with the Laplacian. By combining this commutation property with a rule  to find localized filters, the model requires only O(1) parameters per “feature map”. However,  this construction requires prior knowledge of the graph structure, and was shown only on simple,  low-dimensional graphs. More recently, [12] introduced Shapenet, another generalization of convolutions on non-Euclidean domains based on geodesic polar coordinates, which was successfully  applied to shape analysis, and allows comparison across different manifolds. However, it also requires prior knowledge of the manifolds

在过去，图形或相似性估计方面也得到了广泛的研究。 例如，[15]通过使用“ 1-罚逻辑回归”识别某些图形模型，从统计的角度研究图形的估计。 同样，[3]考虑了通过一系列Haar收缩来学习深度结构的问题，这些收缩是通过对特征使用无监督的配对标准来学习的。

The graph or similarity estimation aspects have also been extensively studied in the past. For instance, [15] studies the estimation of the graph from a statistical point of view, through the identification of a certain graphical model using `1-penalized logistic regression. Also, [3] considers the  problem of learning a deep architecture through a series of Haar contractions, which are learnt using  an unsupervised pairing criteria over the features.



## 3

### 3.1 spectral network

我们的工作基于[2]，它引入了频谱网络。 我们在这里回顾了定义及其主要属性。 频谱网络通过图傅立叶变换来概括卷积网络，而图傅里叶变换又是通过网格上的拉普拉斯算子对图拉普拉斯算子的概括来定义的。 输入向量x∈R N被看作是在具有N个节点的图G上定义的信号。

Our work builds upon [2] which introduced spectral networks. We recall the definition here and its  main properties. A spectral network generalizes a convolutional network through the Graph Fourier  Transform, which is in turn defined via a generalization of the Laplacian operator on the grid to the  graph Laplacian. An input vector x ∈ R  N is seen as a a signal defined on a graph G with N nodes.

注意，在W表示lattice的情况下，根据L的定义，我们恢复了离散的拉普拉斯算子∆。 还要注意，拉普拉斯算子与平移运算符交换，该平移运算符在傅立叶基础上对角线化。 因此，Δ的特征向量由离散傅里叶变换（DFT）矩阵给出。 然后，我们通过注意到卷积是定义为在傅立叶域中对角线化的线性算符（也称为卷积定理[11]）来恢复经典卷积算符。

Note that in the case  where W represents the lattice, from the definition of L we recover the discrete Laplacian operator  ∆. Also note that the Laplacian commutes with the translation operator, which is diagonalized in  the Fourier basis. It follows that the eigenvectors of ∆ are given by the Discrete Fourier Transform  (DFT) matrix. We then recover a classical convolution operator by noting that convolutions are by  definition linear operators that diagonalize in the Fourier domain (also known as the Convolution  Theorem [11]).

因此，图上学习filter等于学习频谱乘数（spectral multipliers）

将卷积扩展到具有多个输入通道的输入x很简单。 如果x是具有M个输入通道和N个位置的信号，则在每个通道上应用变换U，然后使用乘数wg =（wi，j; i≤N，j≤M）。

Extending the convolution to inputs x with multiple input channels is straightforward. If x is a signal  with M input channels and N locations, we apply the transformation U on each channel, and then  use multipliers wg = (wi,j ; i ≤ N , j ≤ M).

但是，对于每个特征图g，我们通常需要将卷积核限制为具有较小的空间支持，而与输入像素N的数量无关，这使模型能够学习与N无关的多个参数。为了在频域中恢复相似的学习复杂度，有必要将频谱乘法器的类别限制为与local filter相对应的类别。

However, for each feature map g we need convolutional kernels are typically restricted to have small  spatial support, independent of the number of input pixels N, which enables the model to learn a  number of parameters independent of N. In order to recover a similar learning complexity in the  spectral domain, it is thus necessary to restrict the class of spectral multipliers to those corresponding  to localized filters.

为此，我们试图通过spectral multipliers来表达spatial localization of filter 。 在网格中，频域中的平滑度对应于空间衰减，因为

For that purpose, we seek to express spatial localization of filters in terms of their spectral multipliers. 

## 4 graph construction

尽管在非欧几里德域中的某些识别任务（例如，在[2]或[12]中考虑的那些任务）可能具有输入数据图结构的先验知识，但许多其他实际应用程序却没有这种知识。 因此，有必要在构建频谱网络之前根据数据估算相似度矩阵W。 在本文中，我们考虑了两种可能的图形构造，一种不受测量联合特征统计数据的监督，另一种通过初始网络作为估计的代理进行监督。

Whereas some recognition tasks in non-Euclidean domains, such as those considered in [2] or [12],  might have a prior knowledge of the graph structure of the input data, many other real-world applications do not have such knowledge. It is thus necessary to estimate a similarity matrix W from  the data before constructing the spectral network. In this paper we consider two possible graph constructions, one unsupervised by measuring joint feature statistics, and another one supervised using  an initial network as a proxy for the estimation.

### 4.1 unsupervised

给定数据X∈R L×N，其中L是样本数，N是特征数，从数据估计图结构的最简单方法是考虑特征i和j之间的距离

Given data X ∈ R  L×N , where L is the number of samples and N the number of features, the  simplest approach to estimating a graph structure from the data is to consider a distance between  features i and j given by

其中，Xi是X的第i列。尽管相关性通常足以揭示图像的固有几何结构[16]，但在其他情况下，尤其是在稀疏情况下，高阶统计量的影响可能是不可忽略的。 实际上，在许多情况下，成对的欧几里得距离可能会遭受未归一化的测量。 存在一些获得一些鲁棒性的策略和变体，例如用Z分数替换欧几里得距离（从而通过其标准偏差对每个特征进行归一化），“平方相关”（计算先前变白的特征的平方的相关性）， 或相互信息。

where Xi  is the i-th column of X. While correlations are typically sufficient to reveal the intrinsic  geometrical structure of images [16], the effects of higher-order statistics might be non-negligible in  other contexts, especially in presence of sparsity. Indeed, in many situations the pairwise Euclidean  distances might suffer from unnormalized measurements. Several strategies and variants exist to  gain some robustness, for instance replacing the Euclidean distance by the Z-score (thus renormalizing each feature by its standard deviation), the “square-correlation” (computing the correlation of  squares of previously whitened features), or the mutual information.

然后，该距离用于构建高斯扩散核[1]。

This distance is then used to build a Gaussian diffusion Kernel [1]

在我们的实验中，我们还考虑了自调整扩散核的变体[21]。

In our experiments, we also consider the variant of self-tuning diffusion kernel [21]

其中σi计算为与特征i的第k个最近邻居ik相对应的距离d（i，ik）。 这定义了一个内核，该内核的方差围绕每个特征点进行局部调整，与（1）共享方差的情况相反。

where σi  is computed as the distance d(i, ik) corresponding to the k-th nearest neighbor ik of feature  i. This defines a kernel whose variance is locally adapted around each feature point, as opposed to  (1) where the variance is shared.

（1）的主要优点是它不需要标记的数据。 因此，可以使用共享相同特征的几个数据集来估计相似性，例如在文本分类中。 

The main advantage of (1) is that it does not require labeled data. Therefore, it is possible to estimate  the similarity using several datasets that share the same features, for example in text classification.

### supervised

如前一节所述，特征相似性的概念定义不充分，因为它取决于我们对内核和标准的选择。 因此，在监督学习的情况下，来自输入信号的相关统计数据可能不符合我们强加的相似性标准。 因此，询问最适合特定分类任务的特征相似度可能很有趣。

As discussed in the previous section, the notion of feature similarity is not well defined, as it depends  on our choice of kernel and criteria. Therefore, in the context of supervised learning, the relevant  statistics from the input signals might not correspond to our imposed similarity criteria. It may thus  be interesting to ask for the feature similarity that best suits a particular classification task.

一种特别简单的方法是使用完全连接的网络来确定特征相似度。 给定具有归一化特征1的训练集X∈R L×N和标签y∈{1，。 。 。 ，我们首先训练具有权重W1，...，K的K层的全连接网络。 。 。 ，WK，使用标准的ReLU激活和辍学。 然后，我们提取第一层特征W1∈R N×M1，其中M1是第一层隐藏特征的数量，并考虑距离

A particularly simple approach is to use a fully-connected network to determine the feature similarity. Given a training set with normalized 1  features X ∈ R  L×N and labels y ∈ {1, . . . , C}  L, we  initially train a fully connected network φ with K layers of weights W1, . . . , WK, using standard  ReLU activations and dropout. We then extract the first layer features W1 ∈ R  N×M1  , where M1 is  the number of first-layer hidden features, and consider the distance

然后像（1）一样将其馈入高斯核。 解释是，受监督的标准将通过W1提取最适合分类任务的线性测量值的集合。 因此，如果网络决定在这些线性测量中相似地使用它们，则这两个特征是相似的。

that is then fed into the Gaussian kernel as in (1). The interpretation is that the supervised criterion will extract through W1 a collection of linear measurements that best serve the classification  task. Thus two features are similar if the network decides to use them similarly within these linear  measurements.

这种构造可以看作是将第一个网络学到的信息“提取”到内核中。 在一般情况下，不对图形的尺寸做任何假设，这等于提取N2个参数。 此外，如果我们假设维数为m的低维图结构，则通过将生成的核投影到其前m个方向来提取mN个参数。

This constructions can be seen as “distilling” the information learnt by a first network into a kernel.  In the general case where no assumptions are made on the dimension of the graph, it amounts to  extracting N2 number of parameters). If, moreover, we assume a low-dimensional graph structure of dimension m, then mN parameters are extracted by projecting the resulting kernel into its leading m directions.

最后，观察到可以简单地用任意unit矩阵替换通过对角拉普拉斯图对角化图拉普拉斯算子而获得的本征基U，然后通过反向传播以及模型的其余参数对其进行优化。 尽管我们指出该策略的学习复杂度与全连接网络相同（需要O（KN2）参数，其中K是层数，N是输入维），但我们没有报告该策略的结果。

Finally, observe that one could simply replace the eigen-basis U obtained by diagonalizing the graph  Laplacian by an arbitrary unitary matrix, which is then optimized by back-propagation together with  the rest of the parameters of the model. We do not report results on this strategy, although we point  out that it has the same learning complexity as the Fully Connected network (requiring O(KN2  )  parameters, where K is the number of layers and N is the input dimension).

## Experiment

为了衡量光谱网络在现实世界数据上的性能并探索图估计程序的效果，我们对来自文本分类，计算生物学和计算机视觉的三个数据集进行了实验。 所有实验都是使用带有自定义CUDA后端的Torch机器学习环境完成的。

In order to measure the performance of spectral networks on real-world data and to explore the  effect of the graph estimation procedure, we conducted experiments on three datasets from text  categorization, computational biology and computer vision. All experiments were done using the  Torch machine learning environment with a custom CUDA backend.

我们基于经典卷积网络的频谱网络架构，即通过交织图卷积，ReLU和图池层，并以一个或多个完全连接的层作为结束。 如上所述，与经典ConvNet中使用的高效O（NlogN）快速傅立叶变换相比，训练频谱网络需要对每个输入和输出特征图进行O（N2）矩阵乘法以执行图傅立叶变换。 我们发现，训练带有大量特征图的频谱网络非常耗时，因此选择主要使用具有较少特征图和较小池大小的体系结构进行试验。 我们发现，在网络开始时执行池化对于降低图域的维数和减轻昂贵的图傅立叶变换操作的成本尤为重要。

We based the spectral network architecture on that of a classical convolutional network, namely by  interleaving graph convolution, ReLU and graph pooling layers, and ending with one or more fully  connected layers. As noted above, training a spectral network requires an O(N2  ) matrix multiplication for each input and output feature map to perform the Graph Fourier Transform, compared to  the efficient O(NlogN) Fast Fourier Transform used in classical ConvNets. We found that training  the spectral networks with large numbers of feature maps to be very time-consuming and therefore  chose to experiment mostly with architectures with fewer feature maps and smaller pool sizes. We  found that performing pooling at the beginning of the network was especially important to reduce the  dimensionality in the graph domain and mitigate the cost of the expensive Graph Fourier Transform  operation.

在本节中，我们采用以下表示法来描述网络体系结构：GCk表示具有k个特征图的图卷积层，Pk表示跨度为k且池大小为2k的图池层，FCk表示具有k个隐藏单元的完全连接层。 在我们的结果中，我们还用Pnet表示网络中自由参数的数量，并通过Pgraph估计图时表示自由参数的数量。

In this section we adopt the following notation to descibe network architectures: GCk denotes a  graph convolution layer with k feature maps, Pk denotes a graph pooling layer with stride k and  pool size 2k, and FCk denotes a fully connected layer with k hidden units. In our results we also  denote the number of free parameters in the network by Pnet and the number of free parameters when  estimating the graph by Pgraph.

### Reuters

我们使用了[18]中描述的路透数据集，该数据集包含训练和测试集，每个训练集和测试集包含来自50个互斥类的201,369个文档。 每个文档都表示为2000个常用不停词的对数归一化单词袋。 作为基线，我们使用了[18]的完全连接的网络，该网络具有两个隐藏层，分别由2000和1000个隐藏单元组成，并通过辍学进行了规范化。

We used the Reuters dataset described in [18], which consists of training and test sets each containing 201,369 documents from 50 mutually exclusive classes. Each document is represented as a  log-normalized bag of words for 2000 common non-stop words. As a baseline we used the fullyconnected network of [18] with two hidden layers consisting of 2000 and 1000 hidden units regularized with dropout.

我们通过对包含训练数据十分之一的验证集执行初始实验来选择超参数。 具体来说，我们将二次抽样权重的数量设置为k = 60，学习率设置为0.01，并使用最大合并而不是平均合并。 我们还发现使用AdaGrad [5]可使训练更快。 然后使用相同的超参数对所有架构进行训练。 由于实验的计算量很大，因此在完全收敛之前，我们不会训练所有模型。 这使我们能够探索更多的模型架构，并更清晰地了解图形构造的效果。

We chose hyperparameters by performing initial experiments on a validation set consisting of onetenth of the training data. Specifically, we set the number of subsampled weights to k = 60, learning  rate to 0.01 and used max pooling rather than average pooling. We also found that using AdaGrad  [5] made training faster. All architectures were then trained using the same hyperparameters. Since  the experiments were computationally expensive, we did not train all models until full convergence.  This enabled us to explore more model architectures and obtain a clearer understanding of the effects  of graph construction.

为了应对这些挑战，在这一领域已经做出了巨大的努力，从而产生了有关相关论文和方法的丰富文献。 所采用的体系结构和训练策略也大相径庭，从监督到无监督，从卷积到递归。 但是，据我们所知，几乎没有做出任何努力来系统地总结这些不同方法之间的差异和联系。

To tackle these challenges, tremendous efforts have been made in this area, resulting in a rich literature of related papers and methods. The adopted architectures and training strategies also vary greatly, ranging from supervised to unsupervised and from convolutional to recursive. However, to the best of our knowledge, little effort has been made to systematically summarize the differences and connections between these diverse methods. 